Tutorial Outline

What is Numpy

Numpy is a fundamental library for Python which is used for scientific computing and manipulation of large arrays of data. Numpy performs mathematical, statistical and data manipulation on arrays much faster that regular Python operations. This difference is very important when you are performing large amount of calculations of arrays.

The major advantage of using Numpy for handling arrays, is that Numpy implements all the heavy lifting in C language which is much faster. This allows you to tap into the powerful C language performance from the comfort of Python.

Numpy is a part of the Scipy ecosystem of libraries which is used for mathimatics, science and engineering. This ecosystem includes:

  • Scipy (Fundamental librariy for science, mathematics and engineering)
  • Numpy (Multi-dimensional array for effecient math and logic operations)
  • Pandas (Data analysis and DataFrame & Series objects)
  • Jupyter (Web-based interactive development environment)
  • Matplotlib (Charting library)
  • Sympy (Symbolic computation library)

ndarray object

The ndarray (pronounced N D array) object is the main object for representing your array. This object can handle multi-dimensional array of any size that your memory can store. The biggest differences between a Python list and a ndarray object are these:

  • ndarray is a fixed size array while list has a dynamic size. When you reshare a ndarray a new object is created with the new shape and the old object is deleted from memory.
  • ndarray allows mathematical and logical operations on complete multi-dimensional arrays. With a Python list you will have to iterate over the sequence which take more time and code.
  • ndarray has homogeneous data type for the complete array while Python list can contain multiple data types within a single array.*

Note: You could have multiple data types in a single object that has multiple dimensions.

Mathematical & Logical Operation

Import the library

You will have to import Numpy in your code to use it. A common alias for Numpy is np.


In [1]:
import numpy as np

Create arrays

We will perform a number of mathematical operations on a large list with 5 millions variables to compare Python list to Numpy ndarray


In [2]:
#%%timeit
python_list_1 = list(range(5000000))
python_list_2 = list(range(5000000))

In [3]:
#%%timeit
np_array_1 = np.arange(5000000)
np_array_2 = np.arange(5000000)

Adding a fixed number to an array

We will define a fixed number that we will add to every item in our array. In Numpy it is very simple to do that, just add the two.


In [4]:
%%timeit
np_array_1 + 7


100 loops, best of 3: 11.4 ms per loop

In python it is much more complicated because you will have to do that manually with a loop. It is also much slower.


In [5]:
%%timeit
python_output = []
for i in python_list_1:
    python_output.append(i + 7)


1 loops, best of 3: 626 ms per loop

In [6]:
%%timeit
python_output = [i + 7 for i in python_list_1]


1 loops, best of 3: 373 ms per loop

Adding two lists to each other

You will see the same tpattern here. In it very simple and fast to add two arrays in Numpy, just add them.


In [7]:
%%timeit
np_array_1 + np_array_2


100 loops, best of 3: 14.3 ms per loop

In [8]:
%%timeit
python_output = []
for i in range(len(python_list_1)):
    python_output.append(python_list_1[i] + python_list_2[i])


1 loops, best of 3: 1.16 s per loop

In [9]:
%%timeit
python_output = [python_list_1[i] + python_list_2[i] for i in range(len(python_list_1))]


1 loops, best of 3: 874 ms per loop

Multiplying two lists

This is pointwise product multiplication, meaning that we multiply each value in the fist array by the value of the second array. If we had two arrays $A$ and $B$. Both matricies should have the same size.

$$ C = A \circ B$$$$ C_{ij} = A_{ij} \times B_{ij}$$

In [10]:
%%timeit
np_array_1 * np_array_2


100 loops, best of 3: 14.7 ms per loop

In [11]:
%%timeit
python_output = []
for i in range(len(python_list_1)):
    python_output.append(python_list_1[i] * python_list_2[i])


1 loops, best of 3: 1.18 s per loop

In [12]:
%%timeit
python_output = [python_list_1[i] * python_list_2[i] for i in range(len(python_list_1))]


1 loops, best of 3: 914 ms per loop

Matrix Multiplication

This is different that the previous pointwise product multiplication. If we had matrix $A$ and matrix $B$ and we wanted to multiply them the following must be true.

$A$ has size of $m,n$ (rows, columns) and $B$ has a size of $o,p$. To multiply:

$$A_{m,n} . B_{o,p} = C_{m,p}$$

if $n=o$ (the inner dimension) resulting in a matrix of the size $(m,p)$ (the outer dimension).


In [13]:
A = np.random.randint(1,10,(3,1))
B = np.random.randint(1,10,(1,3))
print("A:\n==")
print(A)
print("B:\n==")
print(B)


A:
==
[[4]
 [3]
 [6]]
B:
==
[[8 8 7]]

In [14]:
C = A.dot(B)
print(C)


[[32 32 28]
 [24 24 21]
 [48 48 42]]

In [15]:
C = B.dot(A)
print(C)


[[98]]

Odd or even


In [16]:
np_array_1 % 2 == 0


Out[16]:
array([ True, False,  True, ..., False,  True, False], dtype=bool)

In [17]:
python_output = [python_list_1[i] % 2 == 0 for i in python_list_1]

Multi-Dimensional Arrays Operations

Starting with 2D array which is 2000 * 2000 = 4M with values like this:

0 0 0 ... 0
1 1 1 ... 1
...
...
1998 1998 1998 ... 1998
1999 1999 1999 ... 1999

In [18]:
python_list_1 = [[i for l in range(2000)] for i in range(2000)]
python_list_2 = [[i for l in range(2000)] for i in range(2000)]

np_array_1 = np.array(python_list_1)
np_array_2 = np.array(python_list_2)

In [19]:
np_array_1


Out[19]:
array([[   0,    0,    0, ...,    0,    0,    0],
       [   1,    1,    1, ...,    1,    1,    1],
       [   2,    2,    2, ...,    2,    2,    2],
       ..., 
       [1997, 1997, 1997, ..., 1997, 1997, 1997],
       [1998, 1998, 1998, ..., 1998, 1998, 1998],
       [1999, 1999, 1999, ..., 1999, 1999, 1999]])

Multiply 2D Arrays

This is pointwise product multiplication, meaning that we multiply each value in the fist array by the value of the second array. If we had two arrays $A$ and $B$. Both matricies should have the same size.

$$ C = A \circ B$$$$ C_{ij} = A_{ij} \times B_{ij}$$

In [20]:
%%timeit
np_array_1 * np_array_2


100 loops, best of 3: 8.55 ms per loop

In [21]:
%%timeit
python_output = []
for i in range(len(python_list_1)):
    python_output.append([])
    for l in range(len(python_list_2)):
        python_output[-1:].append(python_list_1[i][l] * python_list_2[i][l])


1 loops, best of 3: 1.95 s per loop

Indexing

1D Indexing


In [22]:
np_array_1 = np.array(range(5000000))

In [23]:
np_array_1[0]


Out[23]:
0

In [24]:
np_array_1[-1]


Out[24]:
4999999

In [25]:
np_array_1[1:10]


Out[25]:
array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [26]:
np_array_1[1:10:2]


Out[26]:
array([1, 3, 5, 7, 9])

2D Indexing


In [27]:
python_list_1 = [[i+l for l in range(2000)] for i in range(2000)]
np_array_1 = np.array(python_list_1)
np_array_1


Out[27]:
array([[   0,    1,    2, ..., 1997, 1998, 1999],
       [   1,    2,    3, ..., 1998, 1999, 2000],
       [   2,    3,    4, ..., 1999, 2000, 2001],
       ..., 
       [1997, 1998, 1999, ..., 3994, 3995, 3996],
       [1998, 1999, 2000, ..., 3995, 3996, 3997],
       [1999, 2000, 2001, ..., 3996, 3997, 3998]])

In [28]:
np_array_1[2]


Out[28]:
array([   2,    3,    4, ..., 1999, 2000, 2001])

In [29]:
np_array_1[2][0]


Out[29]:
2

In [30]:
np_array_1[2,0]


Out[30]:
2

In [31]:
np_array_1[1:5,0]


Out[31]:
array([1, 2, 3, 4])

Filtering

You find find items in a numpy array using a mask of booleans. The mask "or filter" should be the same shape as data and returns values where the mask is equal to True.


In [32]:
%%HTML
<img src="" alt="Array Filter" />



In [33]:
np_array_1 = np.array(range(10))
print(np_array_1)


[0 1 2 3 4 5 6 7 8 9]

In [34]:
mask = np.array([True, False, False, False, False, False, False, False, False, True])
print(np_array_1[mask])


[0 9]

In [35]:
np_array_1[np_array_1 < 5]


Out[35]:
array([0, 1, 2, 3, 4])

In [36]:
np_array_1[np_array_1 % 2 == 0]


Out[36]:
array([0, 2, 4, 6, 8])

Applying a function

To apply a function to a numpy array, you have to vectorize the function.


In [37]:
def isprime(n):
    '''check if integer n is a prime'''
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2: 
        return True    
    # all other even numbers are not primes
    if not n & 1: 
        return False
    # range starts with 3 and only needs to go up the squareroot of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

visprime = np.vectorize(isprime)

In [38]:
np_array_1[visprime(np_array_1)]


Out[38]:
array([2, 3, 5, 7])

Exploring ndarray object

Exploring booleans with all() and any()


In [39]:
np_array_1 = np.random.randint(1,10, size=(500000,))
np_array_1


Out[39]:
array([6, 7, 4, ..., 6, 8, 1])

In [40]:
np_array_1 % 2 == 0


Out[40]:
array([ True, False,  True, ...,  True,  True, False], dtype=bool)

In [41]:
(np_array_1 % 2 == 0).all()


Out[41]:
False

In [42]:
(np_array_1 % 2 == 0).any()


Out[42]:
True

Exploring Basic Stats


In [43]:
np_array_1.max()


Out[43]:
9

In [44]:
np_array_1.min()


Out[44]:
1

In [45]:
np_array_1.mean()


Out[45]:
5.0037859999999998

In [46]:
np_array_1.std()


Out[46]:
2.580650240967187

Exploring data type shape and memory use


In [47]:
np_array_1.dtype


Out[47]:
dtype('int64')

In [48]:
np_array_1.shape


Out[48]:
(500000,)

In [49]:
np_array_1.ndim


Out[49]:
1

In [50]:
np_array_1.size * np_array_1.itemsize


Out[50]:
4000000

In [51]:
"MB: %0.1f" % (_ / 1024 / 1024)


Out[51]:
'MB: 3.8'

Manuplating data type and shape


In [52]:
np_array_1.astype(float)


Out[52]:
array([ 6.,  7.,  4., ...,  6.,  8.,  1.])

In [53]:
np_array_1.reshape(500, 1000)


Out[53]:
array([[6, 7, 4, ..., 5, 8, 3],
       [4, 7, 6, ..., 1, 6, 1],
       [6, 6, 3, ..., 5, 5, 8],
       ..., 
       [5, 1, 8, ..., 9, 2, 5],
       [1, 9, 5, ..., 2, 8, 1],
       [7, 8, 1, ..., 6, 8, 1]])

In [54]:
np_array_1.reshape(100, 1000, 5)


Out[54]:
array([[[6, 7, 4, 2, 9],
        [7, 9, 5, 7, 5],
        [1, 2, 3, 5, 3],
        ..., 
        [5, 7, 1, 9, 7],
        [3, 9, 8, 5, 8],
        [5, 6, 4, 4, 9]],

       [[6, 4, 8, 7, 8],
        [6, 3, 5, 2, 5],
        [5, 5, 1, 4, 6],
        ..., 
        [4, 4, 1, 5, 5],
        [4, 3, 9, 6, 4],
        [8, 1, 2, 1, 9]],

       [[7, 2, 3, 1, 9],
        [8, 7, 5, 2, 8],
        [9, 3, 3, 7, 3],
        ..., 
        [7, 4, 4, 8, 7],
        [4, 6, 7, 4, 5],
        [9, 9, 8, 9, 2]],

       ..., 
       [[3, 4, 1, 7, 4],
        [8, 9, 3, 4, 8],
        [9, 8, 4, 1, 1],
        ..., 
        [6, 6, 4, 5, 9],
        [7, 1, 4, 1, 9],
        [2, 1, 5, 5, 7]],

       [[2, 9, 1, 8, 5],
        [7, 2, 5, 5, 4],
        [3, 4, 9, 6, 4],
        ..., 
        [7, 9, 8, 2, 6],
        [4, 6, 3, 3, 8],
        [9, 2, 6, 9, 9]],

       [[3, 8, 5, 5, 4],
        [7, 9, 8, 3, 9],
        [5, 1, 4, 4, 4],
        ..., 
        [5, 5, 8, 9, 9],
        [2, 3, 5, 5, 3],
        [6, 4, 6, 8, 1]]])

In [55]:
%%HTML
<img src="" alt="3D Array" />


Array Creation


In [56]:
np.array(range(10))


Out[56]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [57]:
np.zeros((10,), dtype=int)


Out[57]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [58]:
np.ones((10,))


Out[58]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [59]:
np.arange(4, 14)


Out[59]:
array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13])

In [60]:
np.linspace(1, 10)


Out[60]:
array([  1.        ,   1.18367347,   1.36734694,   1.55102041,
         1.73469388,   1.91836735,   2.10204082,   2.28571429,
         2.46938776,   2.65306122,   2.83673469,   3.02040816,
         3.20408163,   3.3877551 ,   3.57142857,   3.75510204,
         3.93877551,   4.12244898,   4.30612245,   4.48979592,
         4.67346939,   4.85714286,   5.04081633,   5.2244898 ,
         5.40816327,   5.59183673,   5.7755102 ,   5.95918367,
         6.14285714,   6.32653061,   6.51020408,   6.69387755,
         6.87755102,   7.06122449,   7.24489796,   7.42857143,
         7.6122449 ,   7.79591837,   7.97959184,   8.16326531,
         8.34693878,   8.53061224,   8.71428571,   8.89795918,
         9.08163265,   9.26530612,   9.44897959,   9.63265306,
         9.81632653,  10.        ])

In [61]:
np.linspace(1, 10, 37)


Out[61]:
array([  1.  ,   1.25,   1.5 ,   1.75,   2.  ,   2.25,   2.5 ,   2.75,
         3.  ,   3.25,   3.5 ,   3.75,   4.  ,   4.25,   4.5 ,   4.75,
         5.  ,   5.25,   5.5 ,   5.75,   6.  ,   6.25,   6.5 ,   6.75,
         7.  ,   7.25,   7.5 ,   7.75,   8.  ,   8.25,   8.5 ,   8.75,
         9.  ,   9.25,   9.5 ,   9.75,  10.  ])

Log Space

np.logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None)

Log space creates num values between $base^{start}$ and $base^{stop}$. By default, the base is 10.


In [62]:
np.logspace(1, 5, 5)


Out[62]:
array([  1.00000000e+01,   1.00000000e+02,   1.00000000e+03,
         1.00000000e+04,   1.00000000e+05])

In [63]:
np.logspace(-1, 5, 7)


Out[63]:
array([  1.00000000e-01,   1.00000000e+00,   1.00000000e+01,
         1.00000000e+02,   1.00000000e+03,   1.00000000e+04,
         1.00000000e+05])

In [64]:
np.logspace(-1, 5, 7, base=2)


Out[64]:
array([  0.5,   1. ,   2. ,   4. ,   8. ,  16. ,  32. ])

Numpy array as an image

Import matplotlib

First we will import a library for showing images and charts called matplotlib (The next tutorial in this series is about it).


In [65]:
import matplotlib.pyplot as plt
%matplotlib inline

Creating an RGB image

Now we will create a 3D array of size 50 x 50 x 3 which will represent an image os the size 50px by 50px with RGB (Red Green Blue) values for each pixel.


In [66]:
randim_image = np.random.rand(50,50,3)
plt.imshow(randim_image, interpolation="nearest")


Out[66]:
<matplotlib.image.AxesImage at 0x7f07d2cafc50>

Set Red to 0 (remove the red color)


In [67]:
randim_image[:,:,0] = 0
plt.imshow(randim_image, interpolation="nearest")


Out[67]:
<matplotlib.image.AxesImage at 0x7f07d2b8e0b8>

Divide Blue by 3 (reduce blue to third)


In [68]:
randim_image[:,:,2] /= 3
plt.imshow(randim_image, interpolation="nearest")


Out[68]:
<matplotlib.image.AxesImage at 0x7f07d2bba208>

Set green in every other row to 1


In [69]:
randim_image[::2,:,1] = 1

plt.imshow(randim_image, interpolation="nearest")


Out[69]:
<matplotlib.image.AxesImage at 0x7f07d0a98240>

There is more

Numpy is a large library with many useful out of the box functions like: